Problemas de Regresión¶
Carlos Adrián Palmieri Álvarez - A01635776¶
Life Expectancy¶
import pandas as pd
# Cargando los datos
df = pd.read_csv('../data/raw/life_expectancy_data.csv')
df
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | Zimbabwe | 2004 | Developing | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | ... | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
| 2934 | Zimbabwe | 2003 | Developing | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | ... | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
| 2935 | Zimbabwe | 2002 | Developing | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | ... | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
| 2936 | Zimbabwe | 2001 | Developing | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | ... | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
| 2937 | Zimbabwe | 2000 | Developing | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | ... | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life expectancy 2928 non-null float64 4 Adult Mortality 2928 non-null float64 5 infant deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 percentage expenditure 2938 non-null float64 8 Hepatitis B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness 1-19 years 2904 non-null float64 19 thinness 5-9 years 2904 non-null float64 20 Income composition of resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
df.columns
Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling'],
dtype='object')
# Borrando las columnas que no se van a utilizar en el análisis
df.drop(columns=['Status','Year','Country','percentage expenditure', 'under-five deaths ', ' HIV/AIDS', ' thinness 5-9 years'], axis=1, inplace=True)
df
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | GDP | Population | thinness 1-19 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | 584.259210 | 33736494.0 | 17.2 | 0.479 | 10.1 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | 612.696514 | 327582.0 | 17.5 | 0.476 | 10.0 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | 631.744976 | 31731688.0 | 17.7 | 0.470 | 9.9 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | 669.959000 | 3696958.0 | 17.9 | 0.463 | 9.8 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | 63.537231 | 2978599.0 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | 454.366654 | 12777511.0 | 9.4 | 0.407 | 9.2 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | 453.351155 | 12633897.0 | 9.8 | 0.418 | 9.5 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | 57.348340 | 125525.0 | 1.2 | 0.427 | 10.0 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | 548.587312 | 12366165.0 | 1.6 | 0.427 | 9.8 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | 547.358878 | 12222251.0 | 11.0 | 0.434 | 9.8 |
2938 rows × 15 columns
df.describe()
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | GDP | Population | thinness 1-19 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2928.000000 | 2928.000000 | 2938.000000 | 2744.000000 | 2385.000000 | 2938.000000 | 2904.000000 | 2919.000000 | 2712.00000 | 2919.000000 | 2490.000000 | 2.286000e+03 | 2904.000000 | 2771.000000 | 2775.000000 |
| mean | 69.224932 | 164.796448 | 30.303948 | 4.602861 | 80.940461 | 2419.592240 | 38.321247 | 82.550188 | 5.93819 | 82.324084 | 7483.158469 | 1.275338e+07 | 4.839704 | 0.627551 | 11.992793 |
| std | 9.523867 | 124.292079 | 117.926501 | 4.052413 | 25.070016 | 11467.272489 | 20.044034 | 23.428046 | 2.49832 | 23.716912 | 14270.169342 | 6.101210e+07 | 4.420195 | 0.210904 | 3.358920 |
| min | 36.300000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 0.37000 | 2.000000 | 1.681350 | 3.400000e+01 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 63.100000 | 74.000000 | 0.000000 | 0.877500 | 77.000000 | 0.000000 | 19.300000 | 78.000000 | 4.26000 | 78.000000 | 463.935626 | 1.957932e+05 | 1.600000 | 0.493000 | 10.100000 |
| 50% | 72.100000 | 144.000000 | 3.000000 | 3.755000 | 92.000000 | 17.000000 | 43.500000 | 93.000000 | 5.75500 | 93.000000 | 1766.947595 | 1.386542e+06 | 3.300000 | 0.677000 | 12.300000 |
| 75% | 75.700000 | 228.000000 | 22.000000 | 7.702500 | 97.000000 | 360.250000 | 56.200000 | 97.000000 | 7.49250 | 97.000000 | 5910.806335 | 7.420359e+06 | 7.200000 | 0.779000 | 14.300000 |
| max | 89.000000 | 723.000000 | 1800.000000 | 17.870000 | 99.000000 | 212183.000000 | 87.300000 | 99.000000 | 17.60000 | 99.000000 | 119172.741800 | 1.293859e+09 | 27.700000 | 0.948000 | 20.700000 |
df.shape
(2938, 15)
# Revisando los valores nulos
df.isnull().sum()
Life expectancy 10 Adult Mortality 10 infant deaths 0 Alcohol 194 Hepatitis B 553 Measles 0 BMI 34 Polio 19 Total expenditure 226 Diphtheria 19 GDP 448 Population 652 thinness 1-19 years 34 Income composition of resources 167 Schooling 163 dtype: int64
# Analizando los valores faltantes
df[df.isnull().any(axis=1)]
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | GDP | Population | thinness 1-19 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 75.6 | 19.0 | 21 | NaN | 95.0 | 63 | 59.5 | 95.0 | NaN | 95.0 | 4132.762920 | 39871528.0 | 6.0 | 0.743 | 14.4 |
| 44 | 71.7 | 146.0 | 20 | 0.34 | NaN | 15374 | 47.0 | 87.0 | 3.60 | 87.0 | 294.335560 | 3243514.0 | 6.3 | 0.663 | 11.5 |
| 45 | 71.6 | 145.0 | 20 | 0.36 | NaN | 5862 | 46.1 | 86.0 | 3.73 | 86.0 | 1774.336730 | 3199546.0 | 6.3 | 0.653 | 11.1 |
| 46 | 71.4 | 145.0 | 20 | 0.23 | NaN | 2686 | 45.3 | 89.0 | 3.84 | 89.0 | 1732.857979 | 31592153.0 | 6.4 | 0.644 | 10.9 |
| 47 | 71.3 | 145.0 | 21 | 0.25 | NaN | 0 | 44.4 | 86.0 | 3.49 | 86.0 | 1757.177970 | 3118366.0 | 6.5 | 0.636 | 10.7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2918 | 46.4 | 64.0 | 39 | 2.33 | NaN | 881 | 17.6 | 85.0 | 8.18 | 83.0 | 429.158343 | 11421984.0 | 7.3 | 0.443 | 10.2 |
| 2919 | 45.5 | 69.0 | 41 | 2.44 | NaN | 25036 | 17.3 | 85.0 | 6.93 | 84.0 | 377.135244 | 111249.0 | 7.4 | 0.433 | 10.0 |
| 2920 | 44.6 | 611.0 | 43 | 2.61 | NaN | 16997 | 17.1 | 86.0 | 6.56 | 85.0 | 378.273624 | 1824125.0 | 7.4 | 0.424 | 9.8 |
| 2921 | 43.8 | 614.0 | 44 | 2.62 | NaN | 30930 | 16.8 | 85.0 | 7.16 | 85.0 | 341.955625 | 1531221.0 | 7.5 | 0.418 | 9.6 |
| 2922 | 67.0 | 336.0 | 22 | NaN | 87.0 | 0 | 31.8 | 88.0 | NaN | 87.0 | 118.693830 | 15777451.0 | 5.6 | 0.507 | 10.3 |
1289 rows × 15 columns
# Identificando qué tipo de valores faltantes existen, si son MCAR, MAR o MNAR
# Prueba de correlaciones en aquellos valores vacíos
# Guardando los registro con valores faltantes
missing_values = df[df.isnull().any(axis=1)]
missing_values
# Revisando la correlación de los valores faltantes
missing_values.corr()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(missing_values.corr(), annot=True)
plt.show()
1.Grafica cada variable predictora vs la variable de respuesta asignadas a tu número de matrícula.¶
# Graficando cada variable vs la columna "Life expectancy "
import matplotlib.pyplot as plt
import seaborn as sns
# Gráficas acomodadas en 4x4 sin la gráfica Light expectancy vs Light expectancy
fig, axs = plt.subplots(4, 4, figsize=(20, 20))
for i, column in enumerate(df.columns):
if column != 'Life expectancy ':
sns.scatterplot(data=df, x=column, y='Life expectancy ', ax=axs[i // 4, i % 4])
2.Implementa la fórmula directa para calcular los coeficientes de un modelo de regresión lineal, y obtenga con ella el modelo que corresponde a la variable de respuesta y las variables predictoras asignadas a tu número de matrícula.¶
import numpy as np
import numpy.linalg as ln
# Implementando la fórmula directa de Regresión lineal múltiple
df_clean = df.dropna()
X = df_clean.drop(columns='Life expectancy ')
y = df_clean['Life expectancy ']
X_np = X.to_numpy()
y_np = y.to_numpy()
X_np.shape, y_np.shape
((1649, 14), (1649,))
# Revisar si las matrices tienen valores vacios
np.isnan(X_np).sum(), np.isnan(y_np).sum()
(np.int64(0), np.int64(0))
print(X_np)
[[2.63e+02 6.20e+01 1.00e-02 ... 1.72e+01 4.79e-01 1.01e+01] [2.71e+02 6.40e+01 1.00e-02 ... 1.75e+01 4.76e-01 1.00e+01] [2.68e+02 6.60e+01 1.00e-02 ... 1.77e+01 4.70e-01 9.90e+00] ... [7.30e+01 2.50e+01 4.43e+00 ... 1.20e+00 4.27e-01 1.00e+01] [6.86e+02 2.50e+01 1.72e+00 ... 1.60e+00 4.27e-01 9.80e+00] [6.65e+02 2.40e+01 1.68e+00 ... 1.10e+01 4.34e-01 9.80e+00]]
# Eror de Gradiente
def grad(X, y, beta):
n = len(y)
y_pred = X @ beta
res = y - y_pred
tmp = res[:,np.newaxis] * X # Esto sirve para operar elemento a elemento
return -2/n * tmp.sum(axis=0)
# Generando funcion de regresion lineal multiple
from sklearn.preprocessing import StandardScaler
def fit_model(X,y, alpha = .005,maxit = 10000):
# Normalización de los datos
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.reshape(-1, 1)).flatten()
# de predictores
n = X.shape[1]
#print(f"Numero de predictores: {n}")
#Inicializando beta
beta = 2*np.random.rand(n) - 1.0
#print(f"Beta inicial: {beta}")
#Optimizando el algoritmo
it = 0
while (np.linalg.norm(grad(X_scaled, y_scaled, beta)) > 1e-4) and (it < maxit):
beta = beta - alpha * grad(X_scaled, y_scaled, beta)
# Chequeo para valores extremos de beta
if np.any(np.abs(beta) > 1e10): # Si algún valor en beta es extremadamente grande
print(f"Warning: Valores de Beta empieza a hacerse grandes en iteración {it}")
break
# Chequeo de NaN en beta
if np.any(np.isnan(beta)):
print(f"NaN se detectó en la iteración {it}")
break
it = it + 1
#print(it)
return beta, scaler_X, scaler_y
# Modelo de regresion lineal multiple
beta = fit_model(X_np, y_np)
print("Los coeficientes son: ", beta)
Los coeficientes son: (array([-0.41557125, -0.02833773, -0.07918307, -0.00251173, 0.00986548,
0.08712028, 0.02295472, -0.00678517, 0.05081052, 0.08809878,
0.0220853 , -0.04523112, 0.23104851, 0.28071011]), StandardScaler(), StandardScaler())
3. Evalúa con validación cruzada de k-pliegues tu modelo, calculando los valores de R2, MSE y MAE.¶
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import LeaveOneOut
import warnings
from sklearn.exceptions import UndefinedMetricWarning
# Suprimir los warnings de tipo UndefinedMetricWarning
warnings.simplefilter(action='ignore', category=UndefinedMetricWarning)
# Funcion para predecir
def predict(X, beta, scaler_X, scaler_y):
X_scaled = scaler_X.transform(X) # Normalizar las nuevas entradas
y_pred_scaled = X_scaled @ beta
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten() # Desnormalizar
return y_pred
# Evaluando con validación cruzada
def validacion_cruzada(X, y, k):
kf = KFold(n_splits=k, shuffle=True)
mse_cv = []
mae_cv = []
r2_cv = []
for train_index, test_index in kf.split(X):
# Fase de entrenamiento
X_train, y_train = X[train_index, :], y[train_index]
beta_cv, scaler_X, scaler_y = fit_model(X_train, y_train)
# Fase de prueba
X_test, y_test = X[test_index, :], y[test_index]
y_pred = predict(X_test, beta_cv, scaler_X, scaler_y)
# Calcular MSE, MAE y R^2
mse_i = mean_squared_error(y_test, y_pred)
print('MSE = ', mse_i)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
print('MAE = ', mae_i)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
print('R^2 = ', r2_i)
r2_cv.append(r2_i)
print('MSE Promedio:', np.average(mse_cv), ' MAE Promedio:', np.average(mae_cv), ' R^2 Promedio:', np.average(r2_cv))
# Revisando si X_np y y_np tienen valores nulos
#np.isnan(X_np).sum(), np.isnan(y_np).sum()
validacion_cruzada(X_np, y_np, 5)
MSE = 20.361671522748033 MAE = 3.246274965738945 R^2 = 0.7528717589438239 MSE = 22.221650655769384 MAE = 3.2581433266346025 R^2 = 0.7270027093880594 MSE = 15.58103223835667 MAE = 3.003647878204671 R^2 = 0.7221703681355872 MSE = 17.27218951181445 MAE = 2.979266206194986 R^2 = 0.7693544342873956 MSE = 18.51895080880482 MAE = 3.1379932391455836 R^2 = 0.7949390405075046 MSE Promedio: 18.79109894749867 MAE Promedio: 3.125065123183757 R^2 Promedio: 0.7532676622524741
4. Utiliza validación cruzada de Monte Carlo con 1000 iteraciones para encontrar histogramas de R2, MSE y MAE.¶
def monte_carlo_cross_validation(X, y, n_iterations, test_size=0.2):
mse_cv = []
mae_cv = []
r2_cv = []
for _ in range(n_iterations):
# Dividir los datos en conjunto de entrenamiento y conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
# Ajustar el modelo en el conjunto de entrenamiento
beta, scaler_X, scaler_y = fit_model(X_train, y_train)
# Hacer predicciones en el conjunto de prueba
y_pred = predict(X_test, beta, scaler_X, scaler_y)
# Calcular MSE, MAE y R^2
mse_i = mean_squared_error(y_test, y_pred)
#print('MSE = ', mse_i)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
#print('MAE = ', mae_i)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
#print('R^2 = ', r2_i)
r2_cv.append(r2_i)
# Porcentaje de completado del proceso pero eliminando la impresión de cada iteración
print(f'Iteración {_ + 1}/{n_iterations}', end='\r')
print('MSE Promedio:', np.average(mse_cv), ' MAE Promedio:', np.average(mae_cv), ' R^2 Promedio:', np.average(r2_cv))
return np.average(mse_cv), np.average(mae_cv), np.average(r2_cv), mse_cv, mae_cv, r2_cv
# Evaluando con método Monte Carlo
msve_avg, mae_avg, r2_avg, msve, mae, r2 = monte_carlo_cross_validation(X_np, y_np, n_iterations=1000)
print(msve_avg, mae_avg, r2_avg)
MSE Promedio: 18.734391304381823 MAE Promedio: 3.118279494404135 R^2 Promedio: 0.7564712964804281 18.734391304381823 3.118279494404135 0.7564712964804281
MC_mse = msve
MC_mae = mae
MC_r2 = r2
MC_mse_avg = msve_avg
MC_mae_avg = mae_avg
MC_r2_avg = r2_avg
print(f'MSE: {MC_mse_avg}, MAE: {MC_mae_avg}, R^2: {MC_r2_avg}')
MSE: 18.734391304381823, MAE: 3.118279494404135, R^2: 0.7564712964804281
# Generando histograma de MSE y MAE
plt.figure(figsize=(20, 5))
plt.subplot(1, 3, 1)
plt.hist(MC_mse, bins=30, color='blue', edgecolor='black', rwidth=0.70, density=True)
plt.title('Histograma MSE para Monte Carlo')
plt.xlabel('MSE')
plt.ylabel('Frecuencia')
plt.subplot(1, 3, 2)
plt.hist(MC_mae, bins=30, color='green', edgecolor='black', rwidth=0.70, density=True)
plt.title('Histograma MAE para Monte Carlo')
plt.xlabel('MAE')
plt.ylabel('Frecuencia')
plt.show()
5. Utiliza el método de validación cruzada asignado a tu matrícula para mostrar los histogramas de MSE y MAE. ¿Los histogramas son distintos a los obtenidos con el método de Monte Carlo?¶
Método de validación Leave-P-Out(LpOCV) para P = 2¶
def leave_p_out_cross_validation(X, y, p):
# Calculando las particiones necesarias
lpo = LeavePOut(p)
mse_cv = []
mae_cv = []
r2_cv = []
i = 0
it = lpo.get_n_splits(X)
print(f'Número de particiones: {it}')
for train_index, test_index in lpo.split(X):
# Fase de entrenamiento
X_train, y_train = X[train_index, :], y[train_index]
beta_cv, scaler_X, scaler_y = fit_model(X_train, y_train)
# Fase de prueba
X_test, y_test = X[test_index, :], y[test_index]
y_pred = predict(X_test, beta_cv, scaler_X, scaler_y)
# Calcular MSE, MAE y R^2
mse_i = mean_squared_error(y_test, y_pred)
#print('MSE = ', mse_i)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
#print('MAE = ', mae_i)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
#print('R^2 = ', r2_i)
r2_cv.append(r2_i)
i += 1
# Calculando el procentaje de avance
percent = i * 100 / it
print(f'Porcentaje de avance: ({percent}%)', end='\r')
print('MSE Promedio:', np.average(mse_cv), ' MAE Promedio:', np.average(mae_cv), ' R^2 Promedio:', np.average(r2_cv))
Tiempo de ejecución:
leave_p_out_cross_validation(X_np, y_np, p=2)
Número de particiones: 1358776 (4.950116869888782%))))))
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[1686], line 1 ----> 1 leave_p_out_cross_validation(X_np, y_np, p=2) Cell In[1685], line 17, in leave_p_out_cross_validation(X, y, p) 13 for train_index, test_index in lpo.split(X): 14 15 # Fase de entrenamiento 16 X_train, y_train = X[train_index, :], y[train_index] ---> 17 beta_cv, scaler_X, scaler_y = fit_model(X_train, y_train) 19 # Fase de prueba 20 X_test, y_test = X[test_index, :], y[test_index] Cell In[1665], line 23, in fit_model(X, y, alpha, maxit) 21 it = 0 22 while (np.linalg.norm(grad(X_scaled, y_scaled, beta)) > 1e-4) and (it < maxit): ---> 23 beta = beta - alpha * grad(X_scaled, y_scaled, beta) 25 # Chequeo para valores extremos de beta 26 if np.any(np.abs(beta) > 1e10): # Si algún valor en beta es extremadamente grande Cell In[1664], line 7, in grad(X, y, beta) 5 res = y - y_pred 6 tmp = res[:,np.newaxis] * X # Esto sirve para operar elemento a elemento ----> 7 return -2/n * tmp.sum(axis=0) File c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\numpy\_core\_methods.py:52, in _sum(a, axis, dtype, out, keepdims, initial, where) 50 def _sum(a, axis=None, dtype=None, out=None, keepdims=False, 51 initial=_NoValue, where=True): ---> 52 return umr_sum(a, axis, dtype, out, keepdims, initial, where) KeyboardInterrupt:
Método de validación Leave One Out (LOOCV)¶
# Generando método de validación cruzada con LOOCV debido al tiempo de tardía de LPoCV
def leave_one_out_cross_validation(X, y):
# Calculando las particiones necesarias
loo = LeaveOneOut()
mse_cv = []
mae_cv = []
r2_cv = []
i = 0
it = loo.get_n_splits(X)
print(f'Número de particiones: {it}')
for train_index, test_index in loo.split(X):
# Fase de entrenamiento
X_train, y_train = X[train_index, :], y[train_index]
beta_cv, scaler_X, scaler_y = fit_model(X_train, y_train)
# Fase de prueba
X_test, y_test = X[test_index, :], y[test_index]
y_pred = predict(X_test, beta_cv, scaler_X, scaler_y)
# Calcular MSE, MAE y R^2
mse_i = mean_squared_error(y_test, y_pred)
#print('MSE = ', mse_i)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
#print('MAE = ', mae_i)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
#print('R^2 = ', r2_i)
r2_cv.append(r2_i)
i += 1
# Calculando el procentaje de avance
percent = round (i * 100 / it, 2)
print(f'Porcentaje de avance: ({percent}%)', end='\r')
print('MSE Promedio:', np.average(mse_cv), ' MAE Promedio:', np.average(mae_cv), ' R^2 Promedio:', np.average(r2_cv))
return np.average(mse_cv), np.average(mae_cv), np.average(r2_cv), mse_cv, mae_cv, r2_cv
LOOCV_msve_avg, LOOCV_mae_avg, LOOCV_r2_avg, LOOCV_msve, LOOCV_mae, LOOCV_r2 = leave_one_out_cross_validation(X_np, y_np)
Número de particiones: 1649 MSE Promedio: 18.61551424388183 MAE Promedio: 3.110112329306491 R^2 Promedio: nan
Histogramas¶
# Generando histograma de MSE y MAE
plt.figure(figsize=(20, 5))
plt.subplot(1, 3, 1)
plt.hist(LOOCV_msve, bins=30, color='blue', edgecolor='black', rwidth=0.70, density=True)
plt.title('Histograma MSE para LOOCV')
plt.xlabel('MSE')
plt.ylabel('Frecuencia')
plt.xlim(0, 200)
# Reduciendo escalas para visualizar mejor el histograma
plt.subplot(1, 3, 2)
plt.hist(LOOCV_mae, bins=30, color='green', edgecolor='black', rwidth=0.70, density=True)
plt.title('Histograma MAE para LOOCV')
plt.xlabel('MAE')
plt.ylabel('Frecuencia')
plt.show()
El histograma para el método de validación Monte Carlo tiene una distribución más o menos simétrica para ámbos errores, ya sea para el MSE y para el MAE; para el MSE se centra en valores entre 17 y 20, lo que puede indicar que la mayoría de veces el error cuadrático medio se mantiene constante, de igual forma, el MAE se matiene simétrica y con un errores centrados en alrededor de 3 lo que puede significar que mantiene rendimientos consistentes con pequeñas variaciones.
Para LOOCV presenta un MSE concentrado al extremo izquierdo, alrededor de 0 a 25 en donde se concentra la mayoria de los errores, lo cual puede significar que en algunas iteraciones el error puede ser demasiado alto, puede ser debido a que tiene una alta variabilidad. En MAE presenta un histograma similar, los datos están cargados a la izquierda con valores pequeños dominando la distribución, sin embargo, en algunos casos se puede llegar a tener errores demasiado altos.
6. Agrega al conjunto de datos columnas que representen los cuadrados de las variables predictoras (por ejemplo, X112, X132), así como los productos entre pares de variables (por ejemplo, X1xX2, X3xX4). Repita los pasos 1, 2 y 3 pero con este nuevo conjunto de datos.¶
df_clean
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | GDP | Population | thinness 1-19 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | 584.259210 | 33736494.0 | 17.2 | 0.479 | 10.1 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | 612.696514 | 327582.0 | 17.5 | 0.476 | 10.0 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | 631.744976 | 31731688.0 | 17.7 | 0.470 | 9.9 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | 669.959000 | 3696958.0 | 17.9 | 0.463 | 9.8 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | 63.537231 | 2978599.0 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | 454.366654 | 12777511.0 | 9.4 | 0.407 | 9.2 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | 453.351155 | 12633897.0 | 9.8 | 0.418 | 9.5 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | 57.348340 | 125525.0 | 1.2 | 0.427 | 10.0 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | 548.587312 | 12366165.0 | 1.6 | 0.427 | 9.8 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | 547.358878 | 12222251.0 | 11.0 | 0.434 | 9.8 |
1649 rows × 15 columns
df_clean_2 = df_clean.copy()
df_clean_2
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | GDP | Population | thinness 1-19 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | 584.259210 | 33736494.0 | 17.2 | 0.479 | 10.1 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | 612.696514 | 327582.0 | 17.5 | 0.476 | 10.0 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | 631.744976 | 31731688.0 | 17.7 | 0.470 | 9.9 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | 669.959000 | 3696958.0 | 17.9 | 0.463 | 9.8 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | 63.537231 | 2978599.0 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | 454.366654 | 12777511.0 | 9.4 | 0.407 | 9.2 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | 453.351155 | 12633897.0 | 9.8 | 0.418 | 9.5 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | 57.348340 | 125525.0 | 1.2 | 0.427 | 10.0 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | 548.587312 | 12366165.0 | 1.6 | 0.427 | 9.8 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | 547.358878 | 12222251.0 | 11.0 | 0.434 | 9.8 |
1649 rows × 15 columns
# Agregando el cuadrado de cada variable excepto "Life expectancy"
for column in df_clean_2.columns:
if column != 'Life expectancy ':
df_clean_2[f'{column}^2'] = df_clean_2[column] ** 2
df_clean_2
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | ... | Measles ^2 | BMI ^2 | Polio^2 | Total expenditure^2 | Diphtheria ^2 | GDP^2 | Population^2 | thinness 1-19 years^2 | Income composition of resources^2 | Schooling^2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | ... | 1331716 | 364.81 | 36.0 | 66.5856 | 4225.0 | 341358.824470 | 1.138151e+15 | 295.84 | 0.229441 | 102.01 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | ... | 242064 | 345.96 | 3364.0 | 66.9124 | 3844.0 | 375397.018268 | 1.073100e+11 | 306.25 | 0.226576 | 100.00 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | ... | 184900 | 327.61 | 3844.0 | 66.0969 | 4096.0 | 399101.714701 | 1.006900e+15 | 313.29 | 0.220900 | 98.01 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | ... | 7767369 | 309.76 | 4489.0 | 72.5904 | 4489.0 | 448845.061681 | 1.366750e+13 | 320.41 | 0.214369 | 96.04 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | ... | 9078169 | 295.84 | 4624.0 | 61.9369 | 4624.0 | 4036.979723 | 8.872052e+12 | 331.24 | 0.206116 | 90.25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | ... | 961 | 734.41 | 4489.0 | 50.8369 | 4225.0 | 206449.056267 | 1.632648e+14 | 88.36 | 0.165649 | 84.64 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | ... | 996004 | 712.89 | 49.0 | 42.5104 | 4624.0 | 205527.269921 | 1.596154e+14 | 96.04 | 0.174724 | 90.25 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | ... | 92416 | 691.69 | 5329.0 | 42.6409 | 5041.0 | 3288.832101 | 1.575653e+10 | 1.44 | 0.182329 | 100.00 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | ... | 279841 | 670.81 | 5776.0 | 37.9456 | 5625.0 | 300948.038887 | 1.529220e+14 | 2.56 | 0.182329 | 96.04 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | ... | 2199289 | 650.25 | 6084.0 | 50.4100 | 6084.0 | 299601.741873 | 1.493834e+14 | 121.00 | 0.188356 | 96.04 |
1649 rows × 29 columns
# Agregando el producto de cada par de variables excepto "Life expectancy", X1*X2 X3*X4, etc.
life_expectancy = df_clean_2["Life expectancy "]
# Filtrar las columnas que no sean cuadradas
non_squared_columns = [col for col in df_clean_2.columns if not col.endswith('^2')]
df_filtered = df_clean_2[non_squared_columns]
df_filtered.drop(columns='Life expectancy ', inplace=True)
#print(df_filtered)
products_df = pd.DataFrame()
for i in range(0, len(df_filtered.columns) - 1, 2):
col1 = df_filtered.columns[i]
col2 = df_filtered.columns[i + 1]
products_df[f'{col1}*{col2}'] = df_filtered[col1] * df_filtered[col2]
# Recuperar las columnas cuadradas para agregarlas después
squared_columns = [col for col in df_clean_2.columns if col.endswith('^2')]
squared_df = df_clean_2[squared_columns]
final_df = pd.concat([df_clean_2, squared_df, products_df], axis=1)
final_df
C:\Users\palmi\AppData\Local\Temp\ipykernel_50008\643249155.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_filtered.drop(columns='Life expectancy ', inplace=True)
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | ... | thinness 1-19 years^2 | Income composition of resources^2 | Schooling^2 | Adult Mortality*infant deaths | Alcohol*Hepatitis B | Measles * BMI | Polio*Total expenditure | Diphtheria *GDP | Population* thinness 1-19 years | Income composition of resources*Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | ... | 295.84 | 0.229441 | 102.01 | 16306.0 | 0.65 | 22041.4 | 48.96 | 37976.848650 | 580267696.8 | 4.8379 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | ... | 306.25 | 0.226576 | 100.00 | 17344.0 | 0.62 | 9151.2 | 474.44 | 37987.183868 | 5732685.0 | 4.7600 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | ... | 313.29 | 0.220900 | 98.01 | 17688.0 | 0.64 | 7783.0 | 504.06 | 40431.678464 | 561650877.6 | 4.6530 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | ... | 320.41 | 0.214369 | 96.04 | 18768.0 | 0.67 | 49051.2 | 570.84 | 44887.253000 | 66175548.2 | 4.5374 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | ... | 331.24 | 0.206116 | 90.25 | 19525.0 | 0.68 | 51823.6 | 535.16 | 4320.531708 | 54210501.8 | 4.3130 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | ... | 88.36 | 0.165649 | 84.64 | 19521.0 | 296.48 | 840.1 | 477.71 | 29533.832510 | 120108603.4 | 3.7444 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | ... | 96.04 | 0.174724 | 90.25 | 18590.0 | 28.42 | 26646.6 | 45.64 | 30827.878554 | 123812190.6 | 3.9710 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | ... | 1.44 | 0.182329 | 100.00 | 1825.0 | 323.39 | 7995.2 | 476.69 | 4071.732140 | 150630.0 | 4.2700 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | ... | 2.56 | 0.182329 | 96.04 | 17150.0 | 130.72 | 13701.1 | 468.16 | 41144.048400 | 19785864.0 | 4.1846 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | ... | 121.00 | 0.188356 | 96.04 | 15960.0 | 132.72 | 37816.5 | 553.80 | 42693.992523 | 134444761.0 | 4.2532 |
1649 rows × 50 columns
final_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1649 entries, 0 to 2937 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Life expectancy 1649 non-null float64 1 Adult Mortality 1649 non-null float64 2 infant deaths 1649 non-null int64 3 Alcohol 1649 non-null float64 4 Hepatitis B 1649 non-null float64 5 Measles 1649 non-null int64 6 BMI 1649 non-null float64 7 Polio 1649 non-null float64 8 Total expenditure 1649 non-null float64 9 Diphtheria 1649 non-null float64 10 GDP 1649 non-null float64 11 Population 1649 non-null float64 12 thinness 1-19 years 1649 non-null float64 13 Income composition of resources 1649 non-null float64 14 Schooling 1649 non-null float64 15 Adult Mortality^2 1649 non-null float64 16 infant deaths^2 1649 non-null int64 17 Alcohol^2 1649 non-null float64 18 Hepatitis B^2 1649 non-null float64 19 Measles ^2 1649 non-null int64 20 BMI ^2 1649 non-null float64 21 Polio^2 1649 non-null float64 22 Total expenditure^2 1649 non-null float64 23 Diphtheria ^2 1649 non-null float64 24 GDP^2 1649 non-null float64 25 Population^2 1649 non-null float64 26 thinness 1-19 years^2 1649 non-null float64 27 Income composition of resources^2 1649 non-null float64 28 Schooling^2 1649 non-null float64 29 Adult Mortality^2 1649 non-null float64 30 infant deaths^2 1649 non-null int64 31 Alcohol^2 1649 non-null float64 32 Hepatitis B^2 1649 non-null float64 33 Measles ^2 1649 non-null int64 34 BMI ^2 1649 non-null float64 35 Polio^2 1649 non-null float64 36 Total expenditure^2 1649 non-null float64 37 Diphtheria ^2 1649 non-null float64 38 GDP^2 1649 non-null float64 39 Population^2 1649 non-null float64 40 thinness 1-19 years^2 1649 non-null float64 41 Income composition of resources^2 1649 non-null float64 42 Schooling^2 1649 non-null float64 43 Adult Mortality*infant deaths 1649 non-null float64 44 Alcohol*Hepatitis B 1649 non-null float64 45 Measles * BMI 1649 non-null float64 46 Polio*Total expenditure 1649 non-null float64 47 Diphtheria *GDP 1649 non-null float64 48 Population* thinness 1-19 years 1649 non-null float64 49 Income composition of resources*Schooling 1649 non-null float64 dtypes: float64(44), int64(6) memory usage: 657.0 KB
#print(final_df.columns[29])
#print(final_df.columns[0])
final_df.iloc[:, 15:43].head(5)
#final_df.head(5)
| Adult Mortality^2 | infant deaths^2 | Alcohol^2 | Hepatitis B^2 | Measles ^2 | BMI ^2 | Polio^2 | Total expenditure^2 | Diphtheria ^2 | GDP^2 | ... | Measles ^2 | BMI ^2 | Polio^2 | Total expenditure^2 | Diphtheria ^2 | GDP^2 | Population^2 | thinness 1-19 years^2 | Income composition of resources^2 | Schooling^2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 69169.0 | 3844 | 0.0001 | 4225.0 | 1331716 | 364.81 | 36.0 | 66.5856 | 4225.0 | 341358.824470 | ... | 1331716 | 364.81 | 36.0 | 66.5856 | 4225.0 | 341358.824470 | 1.138151e+15 | 295.84 | 0.229441 | 102.01 |
| 1 | 73441.0 | 4096 | 0.0001 | 3844.0 | 242064 | 345.96 | 3364.0 | 66.9124 | 3844.0 | 375397.018268 | ... | 242064 | 345.96 | 3364.0 | 66.9124 | 3844.0 | 375397.018268 | 1.073100e+11 | 306.25 | 0.226576 | 100.00 |
| 2 | 71824.0 | 4356 | 0.0001 | 4096.0 | 184900 | 327.61 | 3844.0 | 66.0969 | 4096.0 | 399101.714701 | ... | 184900 | 327.61 | 3844.0 | 66.0969 | 4096.0 | 399101.714701 | 1.006900e+15 | 313.29 | 0.220900 | 98.01 |
| 3 | 73984.0 | 4761 | 0.0001 | 4489.0 | 7767369 | 309.76 | 4489.0 | 72.5904 | 4489.0 | 448845.061681 | ... | 7767369 | 309.76 | 4489.0 | 72.5904 | 4489.0 | 448845.061681 | 1.366750e+13 | 320.41 | 0.214369 | 96.04 |
| 4 | 75625.0 | 5041 | 0.0001 | 4624.0 | 9078169 | 295.84 | 4624.0 | 61.9369 | 4624.0 | 4036.979723 | ... | 9078169 | 295.84 | 4624.0 | 61.9369 | 4624.0 | 4036.979723 | 8.872052e+12 | 331.24 | 0.206116 | 90.25 |
5 rows × 28 columns
print(final_df.columns)
print(final_df.shape)
Index(['Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol',
'Hepatitis B', 'Measles ', ' BMI ', 'Polio', 'Total expenditure',
'Diphtheria ', 'GDP', 'Population', ' thinness 1-19 years',
'Income composition of resources', 'Schooling', 'Adult Mortality^2',
'infant deaths^2', 'Alcohol^2', 'Hepatitis B^2', 'Measles ^2',
' BMI ^2', 'Polio^2', 'Total expenditure^2', 'Diphtheria ^2', 'GDP^2',
'Population^2', ' thinness 1-19 years^2',
'Income composition of resources^2', 'Schooling^2', 'Adult Mortality^2',
'infant deaths^2', 'Alcohol^2', 'Hepatitis B^2', 'Measles ^2',
' BMI ^2', 'Polio^2', 'Total expenditure^2', 'Diphtheria ^2', 'GDP^2',
'Population^2', ' thinness 1-19 years^2',
'Income composition of resources^2', 'Schooling^2',
'Adult Mortality*infant deaths', 'Alcohol*Hepatitis B',
'Measles * BMI ', 'Polio*Total expenditure', 'Diphtheria *GDP',
'Population* thinness 1-19 years',
'Income composition of resources*Schooling'],
dtype='object')
(1649, 50)
final_df = final_df.loc[:,~final_df.columns.duplicated()]
print(final_df.columns)
final_df
Index(['Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol',
'Hepatitis B', 'Measles ', ' BMI ', 'Polio', 'Total expenditure',
'Diphtheria ', 'GDP', 'Population', ' thinness 1-19 years',
'Income composition of resources', 'Schooling', 'Adult Mortality^2',
'infant deaths^2', 'Alcohol^2', 'Hepatitis B^2', 'Measles ^2',
' BMI ^2', 'Polio^2', 'Total expenditure^2', 'Diphtheria ^2', 'GDP^2',
'Population^2', ' thinness 1-19 years^2',
'Income composition of resources^2', 'Schooling^2',
'Adult Mortality*infant deaths', 'Alcohol*Hepatitis B',
'Measles * BMI ', 'Polio*Total expenditure', 'Diphtheria *GDP',
'Population* thinness 1-19 years',
'Income composition of resources*Schooling'],
dtype='object')
| Life expectancy | Adult Mortality | infant deaths | Alcohol | Hepatitis B | Measles | BMI | Polio | Total expenditure | Diphtheria | ... | thinness 1-19 years^2 | Income composition of resources^2 | Schooling^2 | Adult Mortality*infant deaths | Alcohol*Hepatitis B | Measles * BMI | Polio*Total expenditure | Diphtheria *GDP | Population* thinness 1-19 years | Income composition of resources*Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65.0 | 263.0 | 62 | 0.01 | 65.0 | 1154 | 19.1 | 6.0 | 8.16 | 65.0 | ... | 295.84 | 0.229441 | 102.01 | 16306.0 | 0.65 | 22041.4 | 48.96 | 37976.848650 | 580267696.8 | 4.8379 |
| 1 | 59.9 | 271.0 | 64 | 0.01 | 62.0 | 492 | 18.6 | 58.0 | 8.18 | 62.0 | ... | 306.25 | 0.226576 | 100.00 | 17344.0 | 0.62 | 9151.2 | 474.44 | 37987.183868 | 5732685.0 | 4.7600 |
| 2 | 59.9 | 268.0 | 66 | 0.01 | 64.0 | 430 | 18.1 | 62.0 | 8.13 | 64.0 | ... | 313.29 | 0.220900 | 98.01 | 17688.0 | 0.64 | 7783.0 | 504.06 | 40431.678464 | 561650877.6 | 4.6530 |
| 3 | 59.5 | 272.0 | 69 | 0.01 | 67.0 | 2787 | 17.6 | 67.0 | 8.52 | 67.0 | ... | 320.41 | 0.214369 | 96.04 | 18768.0 | 0.67 | 49051.2 | 570.84 | 44887.253000 | 66175548.2 | 4.5374 |
| 4 | 59.2 | 275.0 | 71 | 0.01 | 68.0 | 3013 | 17.2 | 68.0 | 7.87 | 68.0 | ... | 331.24 | 0.206116 | 90.25 | 19525.0 | 0.68 | 51823.6 | 535.16 | 4320.531708 | 54210501.8 | 4.3130 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 44.3 | 723.0 | 27 | 4.36 | 68.0 | 31 | 27.1 | 67.0 | 7.13 | 65.0 | ... | 88.36 | 0.165649 | 84.64 | 19521.0 | 296.48 | 840.1 | 477.71 | 29533.832510 | 120108603.4 | 3.7444 |
| 2934 | 44.5 | 715.0 | 26 | 4.06 | 7.0 | 998 | 26.7 | 7.0 | 6.52 | 68.0 | ... | 96.04 | 0.174724 | 90.25 | 18590.0 | 28.42 | 26646.6 | 45.64 | 30827.878554 | 123812190.6 | 3.9710 |
| 2935 | 44.8 | 73.0 | 25 | 4.43 | 73.0 | 304 | 26.3 | 73.0 | 6.53 | 71.0 | ... | 1.44 | 0.182329 | 100.00 | 1825.0 | 323.39 | 7995.2 | 476.69 | 4071.732140 | 150630.0 | 4.2700 |
| 2936 | 45.3 | 686.0 | 25 | 1.72 | 76.0 | 529 | 25.9 | 76.0 | 6.16 | 75.0 | ... | 2.56 | 0.182329 | 96.04 | 17150.0 | 130.72 | 13701.1 | 468.16 | 41144.048400 | 19785864.0 | 4.1846 |
| 2937 | 46.0 | 665.0 | 24 | 1.68 | 79.0 | 1483 | 25.5 | 78.0 | 7.10 | 78.0 | ... | 121.00 | 0.188356 | 96.04 | 15960.0 | 132.72 | 37816.5 | 553.80 | 42693.992523 | 134444761.0 | 4.2532 |
1649 rows × 36 columns
# Graficando X vs y evitando la gráfica de Life expectancy vs Life expectancy
import matplotlib.pyplot as plt
import seaborn as sns
# Número de gráficos que necesitas
columns_to_plot = [col for col in final_df.columns if col != 'Life expectancy ']
num_graphs = len(columns_to_plot)
ncols = 2 # Número de columnas fijas
nrows = int(np.ceil(num_graphs / ncols)) # Calcula el número de filas necesarias
fig, axs = plt.subplots(nrows, ncols, figsize=(20, nrows * 5))
axs = axs.flatten()
for i, column in enumerate(columns_to_plot):
sns.scatterplot(data=final_df, x=column, y='Life expectancy ', ax=axs[i])
# Desactivar los subplots no utilizados
for j in range(num_graphs, len(axs)):
fig.delaxes(axs[j])
plt.tight_layout()
plt.show()
# Implementando la fórmula directa de Regresión lineal múltiple
X = final_df.drop(columns='Life expectancy ')
y = final_df['Life expectancy ']
X_np = X.to_numpy()
y_np = y.to_numpy()
X_np.shape, y_np.shape
((1649, 35), (1649,))
# Modelo de regresion lineal multiple
beta = fit_model(X_np, y_np)
print("Los coeficientes son: ", beta)
Los coeficientes son: (array([ 0.07276329, -0.10340446, -0.18210298, 0.20016337, 0.01268828,
0.01539157, -0.08500526, 0.03393918, -0.36242215, 0.73686881,
0.00988002, -0.37648582, -0.10607508, 0.19752976, -0.42329008,
-0.03182842, -0.05542406, -0.29576641, 0.01969459, 0.01880983,
0.05652631, -0.04241047, 0.53265629, -0.00836126, -0.05081046,
0.39384124, 1.02055566, 0.11094838, 0.02090734, 0.11747444,
0.0092962 , 0.03622271, -0.70230681, 0.05100309, -0.64385561]), StandardScaler(), StandardScaler())
validacion_cruzada(X_np, y_np, 5)
MSE = 14.274511460132596 MAE = 2.553970031570719 R^2 = 0.8008212715563825 MSE = 10.82209179044934 MAE = 2.519580199297883 R^2 = 0.853794312214851 MSE = 10.014386510878426 MAE = 2.331230425474868 R^2 = 0.8696006441716084 MSE = 11.45547159411425 MAE = 2.476862924703758 R^2 = 0.8639939852960448 MSE = 11.77069167530588 MAE = 2.407790285488747 R^2 = 0.8517197079066726 MSE Promedio: 11.667430606176097 MAE Promedio: 2.457886773307195 R^2 Promedio: 0.847985984229112
7. Implementa regresión Ridge con descenso de gradiente, y genera el gráfico de Ridge para el conjunto de datos original (sin las variables elevadas al cuadrado).¶
# Generando regersión Ridge con descenso de gradientes para df_clean
def grad_ridge(X, y, beta, alpha, lamb):
n = len(y)
y_pred = X @ beta
res = y - y_pred
tmp = res[:,np.newaxis] * X # Esto sirve para operar elemento a elemento
return -2/n * tmp.sum(axis=0) + 2*lamb*beta
# Generando regersión Ridge con descenso de gradientes para df_clean
def fit_model_ridge(X,y, alpha, maxit, lamb):
# Normalización de los datos
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y.reshape(-1, 1)).flatten()
# de predictores
n = X.shape[1]
#print(f"Numero de predictores: {n}")
#Inicializando beta
beta = 2*np.random.rand(n) - 1.0
#print(f"Beta inicial: {beta}")
#Optimizando el algoritmo
it = 0
while (np.linalg.norm(grad_ridge(X_scaled, y_scaled, beta, alpha, lamb)) > 1e-4) and (it < maxit):
beta = beta - alpha * grad_ridge(X_scaled, y_scaled, beta, alpha, lamb)
# Chequeo para valores extremos de beta
if np.any(np.abs(beta) > 1e10): # Si algún valor en beta es extremadamente grande
afafa = 0
#print(f"Warning: Valores de Beta empieza a hacerse grandes en iteración {it}")
#break
# Chequeo de NaN en beta
if np.any(np.isnan(beta)):
print(f"NaN se detectó en la iteración {it}")
break
it = it + 1
#print(it)
return beta, scaler_X, scaler_y
df_clean_ridge = df_clean.copy()
X = df_clean_ridge.drop(columns='Life expectancy ')
y = df_clean_ridge['Life expectancy ']
X_np = X.to_numpy()
y_np = y.to_numpy()
X_np.shape, y_np.shape
((1649, 14), (1649,))
beta = fit_model_ridge(X_np, y_np, alpha=0.005, maxit=10000, lamb=0.1)
print("Los coeficientes son: ", beta)
Los coeficientes son: (array([-0.38303956, -0.02280727, -0.05172093, 0.0017243 , 0.00770393,
0.09641199, 0.02974441, -0.00401481, 0.0504755 , 0.08888828,
0.02065827, -0.05218345, 0.22319087, 0.2470206 ]), StandardScaler(), StandardScaler())
# Definimos los valores de lambda en una escala logarítmica
alphas = np.logspace(-2, 2, 100)
coefs = []
#print(alphas)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_np)
# Bucle para ajustar el modelo a cada valor de lambda
for lamb in alphas:
beta, _, _ = fit_model_ridge(X_np, y_np, alpha=.005, maxit=1000, lamb=lamb)
coefs.append(beta)
coefs = np.array(coefs)
# Generamos la gráfica
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(alphas, coefs)
ax.set_xscale('log')
#Nota de que variable pertenece a qué color acomodada a la izquierda del plot
ax.legend(X.columns, loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_xlabel('lambda (Regularización)')
ax.set_ylabel('Coeficientes')
ax.set_title('Coeficientes del modelo en función de la regularización (lambda)')
plt.axis('tight')
plt.show()
from sklearn.linear_model import Ridge
alphas = np.logspace(-3, 6, 100)
coefs_sklearn = []
for lamb in alphas:
ridge = Ridge(alpha=lamb, fit_intercept=False)
ridge.fit(X_np, y_np)
coefs_sklearn.append(ridge.coef_)
coefs_sklearn = np.array(coefs_sklearn)
# Generamos la gráfica
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(alphas, coefs_sklearn)
ax.set_xscale('log')
ax.legend(X.columns, loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_xlabel('lambda (Regularización)')
ax.set_ylabel('Coeficientes')
ax.set_title('Coeficientes del modelo en función de la regularización (lambda)')
plt.axis('tight')
plt.show()
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14305e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14308e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14311e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14315e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.1432e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14326e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14333e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14342e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14354e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14367e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14385e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14406e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14432e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14464e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14504e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14552e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14613e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14687e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14778e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.14891e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.1503e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.15202e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.15413e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.15673e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.15995e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.16391e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.16879e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.17481e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.18223e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.19138e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.20266e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.21657e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.23371e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.25485e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.28091e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.31303e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.35264e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.40147e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.46167e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.53588e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.62738e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.74019e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.87926e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=3.05071e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=3.2621e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=3.52271e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=3.844e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=4.24013e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=4.72851e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=5.33063e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=6.073e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=6.9883e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=8.11681e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=9.50825e-18): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=1.12239e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=1.33394e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=1.5948e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=1.91647e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.31316e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=2.80211e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=3.40503e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=4.14853e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=5.06545e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=6.19629e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=7.59109e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=9.31162e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
8. Utiliza una librería para generar el gráfico de Lasso para el conjunto de datos original (sin las variables elevadas al cuadrado). ¿Qué variables son más relevantes para el modelo?¶
from sklearn.linear_model import Lasso
coefs_lasso = []
for lamb in alphas:
lasso = Lasso(alpha=lamb, fit_intercept=False)
lasso.fit(X_np, y_np)
coefs_lasso.append(lasso.coef_)
coefs_lasso = np.array(coefs_lasso)
# Generamos la gráfica
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(alphas, coefs_lasso)
ax.set_xscale('log')
ax.legend(X.columns, loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_xlabel('lambda (Regularización)')
ax.set_ylabel('Coeficientes')
ax.set_title('Coeficientes del modelo en función de la regularización (lambda)')
plt.axis('tight')
plt.show()
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.521e+04, tolerance: 8.047e+02 model = cd_fast.enet_coordinate_descent( c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.208e+04, tolerance: 8.047e+02 model = cd_fast.enet_coordinate_descent(
9.Viendo los resultados de regresión, desarrolla una conclusión sobre los siguientes puntos:¶
A. ¿Consideras que el modelo de regresión lineal es efectivo para modelar los datos del problema? ¿Por qué?¶
Considerando el comportamiento de los datos se podría decir que pueden existir mejores modelos para lograr una efectividad mayor, ya que el comportamiento de muchas variables no tiene a ser de tipo lineal, sólo Income composition of resources y Schooling, pueden llegar a tener un mejor comportamiento la el modelo de regresión linea, incluso en la gráfica Ridge, indica que pueden tener una mayor importancia sobre el modelo.
B. ¿Observas una variabilidad importante en los valores de R2, MSE y MAE cuando aplicas validación cruzada? Detalla tu respuesta.¶
Al generar un método de validación cruzada nos aseguramos que todos los datos están siendo utilizados para entrenar en diferentes iteraciones, esto para no lograr un sobreajuste. En los diferentes métodos se obtuvieron diferentes resultados, pero el que tuvo un promedio con menos errores fue LOOCV, sin embargo, tenía pocas observaciones con mucho error y muchas observaciones con poco error, por eso tiene un promedio "normal". Para Monte Carlo, su promedio de errores es consistente, en todas las iteraciones mantiene un error similar, lo que genera que tenga una distribución normal al graficar sus errores.
C. ¿Qué modelo es mejor para los datos del problema, el lineal o el cuadrático? ¿Por qué?¶
Para los datos presentados se debe de hacer la prueba con un modelo cuadrático para combrobar cuál puede llegar a ser mejor, por el momento el modelo lineal se comportó de cierta manera "adecuado" y con algún método para selección de variables y validación cruzada podría llegar a ser más efecto para predecir.
D.¿Qué variables son más relevantes para el modelo según Ridge y Lasso?¶
Las variables más relevantes según Ridge son: Income composition of resources y Schooling. Las variables más relevantes según Lazzo son: Population y Diphtheria.
E. ¿Encuentras alguna relación interesante entre la variable de respuesta y los predictores?¶
Que entre mejor uso de los recursos la variable predictora (Life expectancy) aumenta, con una correlación de .73, también, entre más scolaridad en el país aumenta la espectativa de vida con una correlación de .78.
Seguimiento Telemétrico de la Enfermedad de Parkinson¶
from ucimlrepo import fetch_ucirepo
# fetch dataset
parkinsons_telemonitoring = fetch_ucirepo(id=189)
# data (as pandas dataframes)
X = parkinsons_telemonitoring.data.features
y = parkinsons_telemonitoring.data.targets
print(X)
print(y)
print(X.shape)
print(y.shape)
# metadata
print(parkinsons_telemonitoring.metadata)
# variable information
print(parkinsons_telemonitoring.variables)
age test_time Jitter(%) Jitter(Abs) Jitter:RAP Jitter:PPQ5 \
0 72 5.6431 0.00662 0.000034 0.00401 0.00317
1 72 12.6660 0.00300 0.000017 0.00132 0.00150
2 72 19.6810 0.00481 0.000025 0.00205 0.00208
3 72 25.6470 0.00528 0.000027 0.00191 0.00264
4 72 33.6420 0.00335 0.000020 0.00093 0.00130
... ... ... ... ... ... ...
5870 61 142.7900 0.00406 0.000031 0.00167 0.00168
5871 61 149.8400 0.00297 0.000025 0.00119 0.00147
5872 61 156.8200 0.00349 0.000025 0.00152 0.00187
5873 61 163.7300 0.00281 0.000020 0.00128 0.00151
5874 61 170.7300 0.00282 0.000021 0.00135 0.00166
Jitter:DDP Shimmer Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 \
0 0.01204 0.02565 0.230 0.01438 0.01309
1 0.00395 0.02024 0.179 0.00994 0.01072
2 0.00616 0.01675 0.181 0.00734 0.00844
3 0.00573 0.02309 0.327 0.01106 0.01265
4 0.00278 0.01703 0.176 0.00679 0.00929
... ... ... ... ... ...
5870 0.00500 0.01896 0.160 0.00973 0.01133
5871 0.00358 0.02315 0.215 0.01052 0.01277
5872 0.00456 0.02499 0.244 0.01371 0.01456
5873 0.00383 0.01484 0.131 0.00693 0.00870
5874 0.00406 0.01907 0.171 0.00946 0.01154
Shimmer:APQ11 Shimmer:DDA NHR HNR RPDE DFA PPE \
0 0.01662 0.04314 0.014290 21.640 0.41888 0.54842 0.16006
1 0.01689 0.02982 0.011112 27.183 0.43493 0.56477 0.10810
2 0.01458 0.02202 0.020220 23.047 0.46222 0.54405 0.21014
3 0.01963 0.03317 0.027837 24.445 0.48730 0.57794 0.33277
4 0.01819 0.02036 0.011625 26.126 0.47188 0.56122 0.19361
... ... ... ... ... ... ... ...
5870 0.01549 0.02920 0.025137 22.369 0.64215 0.55314 0.21367
5871 0.01904 0.03157 0.011927 22.886 0.52598 0.56518 0.12621
5872 0.01877 0.04112 0.017701 25.065 0.47792 0.57888 0.14157
5873 0.01307 0.02078 0.007984 24.422 0.56865 0.56327 0.14204
5874 0.01470 0.02839 0.008172 23.259 0.58608 0.57077 0.15336
sex
0 0
1 0
2 0
3 0
4 0
... ...
5870 0
5871 0
5872 0
5873 0
5874 0
[5875 rows x 19 columns]
motor_UPDRS total_UPDRS
0 28.199 34.398
1 28.447 34.894
2 28.695 35.389
3 28.905 35.810
4 29.187 36.375
... ... ...
5870 22.485 33.485
5871 21.988 32.988
5872 21.495 32.495
5873 21.007 32.007
5874 20.513 31.513
[5875 rows x 2 columns]
(5875, 19)
(5875, 2)
{'uci_id': 189, 'name': 'Parkinsons Telemonitoring', 'repository_url': 'https://archive.ics.uci.edu/dataset/189/parkinsons+telemonitoring', 'data_url': 'https://archive.ics.uci.edu/static/public/189/data.csv', 'abstract': "Oxford Parkinson's Disease Telemonitoring Dataset", 'area': 'Health and Medicine', 'tasks': ['Regression'], 'characteristics': ['Tabular'], 'num_instances': 5875, 'num_features': 19, 'feature_types': ['Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['motor_UPDRS', 'total_UPDRS'], 'index_col': ['subject#'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2009, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5ZS3N', 'creators': ['Athanasios Tsanas', 'Max Little'], 'intro_paper': {'title': "Accurate Telemonitoring of Parkinson's Disease Progression by Noninvasive Speech Tests", 'authors': 'A. Tsanas, Max A. Little, P. McSharry, L. Ramig', 'published_in': 'IEEE Transactions on Biomedical Engineering', 'year': 2010, 'url': 'https://www.semanticscholar.org/paper/1fdf33b6d8b1bdb38866ba824c1dcaecdfb6bdd6', 'doi': None}, 'additional_info': {'summary': "This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.\r\n\r\nColumns in the table contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.\r\n\r\nThe data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column. For further information or to pass on comments, please contact Athanasios Tsanas (tsanasthanasis@gmail.com) or Max Little (littlem@physics.ox.ac.uk).\r\n\r\nFurther details are contained in the following reference -- if you use this dataset, please cite:\r\nAthanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009),\r\n'Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests',\r\nIEEE Transactions on Biomedical Engineering (to appear).\r\n\r\nFurther details about the biomedical voice measures can be found in:\r\nMax A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), \r\n'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', \r\nIEEE Transactions on Biomedical Engineering, 56(4):1015-1022\r\n", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': "subject# - Integer that uniquely identifies each subject\r\nage - Subject age\r\nsex - Subject gender '0' - male, '1' - female\r\ntest_time - Time since recruitment into the trial. The integer part is the number of days since recruitment. \r\nmotor_UPDRS - Clinician's motor UPDRS score, linearly interpolated\r\ntotal_UPDRS - Clinician's total UPDRS score, linearly interpolated\r\nJitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of variation in fundamental frequency\r\nShimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - Several measures of variation in amplitude\r\nNHR,HNR - Two measures of ratio of noise to tonal components in the voice\r\nRPDE - A nonlinear dynamical complexity measure\r\nDFA - Signal fractal scaling exponent\r\nPPE - A nonlinear measure of fundamental frequency variation \r\n", 'citation': None}}
name role type demographic \
0 subject# ID Integer None
1 age Feature Integer Age
2 test_time Feature Continuous None
3 Jitter(%) Feature Continuous None
4 Jitter(Abs) Feature Continuous None
5 Jitter:RAP Feature Continuous None
6 Jitter:PPQ5 Feature Continuous None
7 Jitter:DDP Feature Continuous None
8 Shimmer Feature Continuous None
9 Shimmer(dB) Feature Continuous None
10 Shimmer:APQ3 Feature Continuous None
11 Shimmer:APQ5 Feature Continuous None
12 Shimmer:APQ11 Feature Continuous None
13 Shimmer:DDA Feature Continuous None
14 NHR Feature Continuous None
15 HNR Feature Continuous None
16 RPDE Feature Continuous None
17 DFA Feature Continuous None
18 PPE Feature Continuous None
19 motor_UPDRS Target Continuous None
20 total_UPDRS Target Continuous None
21 sex Feature Binary Sex
description units missing_values
0 Integer that uniquely identifies each subject None no
1 Subject age None no
2 Time since recruitment into the trial. The int... None no
3 Several measures of variation in fundamental f... None no
4 Several measures of variation in fundamental f... None no
5 Several measures of variation in fundamental f... None no
6 Several measures of variation in fundamental f... None no
7 Several measures of variation in fundamental f... None no
8 Several measures of variation in amplitude None no
9 Several measures of variation in amplitude None no
10 Several measures of variation in amplitude None no
11 Several measures of variation in amplitude None no
12 Several measures of variation in amplitude None no
13 Several measures of variation in amplitude None no
14 Two measures of ratio of noise to tonal compon... None no
15 Two measures of ratio of noise to tonal compon... None no
16 A nonlinear dynamical complexity measure None no
17 Signal fractal scaling exponent None no
18 A nonlinear measure of fundamental frequency v... None no
19 Clinician's motor UPDRS score, linearly interp... None no
20 Clinician's total UPDRS score, linearly interp... None no
21 Subject sex '0' - male, '1' - female None no
y.drop(columns=['total_UPDRS'], axis=1, inplace=True)
y
C:\Users\palmi\AppData\Local\Temp\ipykernel_54828\1835047426.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy y.drop(columns=['total_UPDRS'], axis=1, inplace=True)
| motor_UPDRS | |
|---|---|
| 0 | 28.199 |
| 1 | 28.447 |
| 2 | 28.695 |
| 3 | 28.905 |
| 4 | 29.187 |
| ... | ... |
| 5870 | 22.485 |
| 5871 | 21.988 |
| 5872 | 21.495 |
| 5873 | 21.007 |
| 5874 | 20.513 |
5875 rows × 1 columns
X.columns
Index(['age', 'test_time', 'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP',
'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3',
'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE',
'DFA', 'PPE', 'sex'],
dtype='object')
X.drop(columns=['test_time','Jitter:PPQ5','Shimmer:APQ3','NHR'], axis=1, inplace=True)
X
C:\Users\palmi\AppData\Local\Temp\ipykernel_54828\493501579.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X.drop(columns=['test_time','Jitter:PPQ5','Shimmer:APQ3','NHR'], axis=1, inplace=True)
| age | Jitter(%) | Jitter(Abs) | Jitter:RAP | Jitter:DDP | Shimmer | Shimmer(dB) | Shimmer:APQ5 | Shimmer:APQ11 | Shimmer:DDA | HNR | RPDE | DFA | PPE | sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 72 | 0.00662 | 0.000034 | 0.00401 | 0.01204 | 0.02565 | 0.230 | 0.01309 | 0.01662 | 0.04314 | 21.640 | 0.41888 | 0.54842 | 0.16006 | 0 |
| 1 | 72 | 0.00300 | 0.000017 | 0.00132 | 0.00395 | 0.02024 | 0.179 | 0.01072 | 0.01689 | 0.02982 | 27.183 | 0.43493 | 0.56477 | 0.10810 | 0 |
| 2 | 72 | 0.00481 | 0.000025 | 0.00205 | 0.00616 | 0.01675 | 0.181 | 0.00844 | 0.01458 | 0.02202 | 23.047 | 0.46222 | 0.54405 | 0.21014 | 0 |
| 3 | 72 | 0.00528 | 0.000027 | 0.00191 | 0.00573 | 0.02309 | 0.327 | 0.01265 | 0.01963 | 0.03317 | 24.445 | 0.48730 | 0.57794 | 0.33277 | 0 |
| 4 | 72 | 0.00335 | 0.000020 | 0.00093 | 0.00278 | 0.01703 | 0.176 | 0.00929 | 0.01819 | 0.02036 | 26.126 | 0.47188 | 0.56122 | 0.19361 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5870 | 61 | 0.00406 | 0.000031 | 0.00167 | 0.00500 | 0.01896 | 0.160 | 0.01133 | 0.01549 | 0.02920 | 22.369 | 0.64215 | 0.55314 | 0.21367 | 0 |
| 5871 | 61 | 0.00297 | 0.000025 | 0.00119 | 0.00358 | 0.02315 | 0.215 | 0.01277 | 0.01904 | 0.03157 | 22.886 | 0.52598 | 0.56518 | 0.12621 | 0 |
| 5872 | 61 | 0.00349 | 0.000025 | 0.00152 | 0.00456 | 0.02499 | 0.244 | 0.01456 | 0.01877 | 0.04112 | 25.065 | 0.47792 | 0.57888 | 0.14157 | 0 |
| 5873 | 61 | 0.00281 | 0.000020 | 0.00128 | 0.00383 | 0.01484 | 0.131 | 0.00870 | 0.01307 | 0.02078 | 24.422 | 0.56865 | 0.56327 | 0.14204 | 0 |
| 5874 | 61 | 0.00282 | 0.000021 | 0.00135 | 0.00406 | 0.01907 | 0.171 | 0.01154 | 0.01470 | 0.02839 | 23.259 | 0.58608 | 0.57077 | 0.15336 | 0 |
5875 rows × 15 columns
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5875 entries, 0 to 5874 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 5875 non-null int64 1 Jitter(%) 5875 non-null float64 2 Jitter(Abs) 5875 non-null float64 3 Jitter:RAP 5875 non-null float64 4 Jitter:DDP 5875 non-null float64 5 Shimmer 5875 non-null float64 6 Shimmer(dB) 5875 non-null float64 7 Shimmer:APQ5 5875 non-null float64 8 Shimmer:APQ11 5875 non-null float64 9 Shimmer:DDA 5875 non-null float64 10 HNR 5875 non-null float64 11 RPDE 5875 non-null float64 12 DFA 5875 non-null float64 13 PPE 5875 non-null float64 14 sex 5875 non-null int64 dtypes: float64(13), int64(2) memory usage: 688.6 KB
# Graficando cada variable vs la columna "motor_UPDRS"
import matplotlib.pyplot as plt
import seaborn as sns
# Transformando y a un array de una dimensión
y_arr = y.values.flatten()
fig, axs = plt.subplots(4, 4, figsize=(20, 20))
for i, column in enumerate(X.columns):
if column != 'motor_UPDRS':
sns.scatterplot(data=X, x=column, y=y_arr, ax=axs[i // 4, i % 4])
axs[i // 4, i % 4].set_title(f'{column} vs motor_UPDRS')
# Ajustar el layout para que no se sobrepongan las gráficas
plt.tight_layout()
plt.show()
1. Evalúa con validación cruzada un modelo de regresión lineal para las variables asignadas según tu matrícula utilizando alguna librería o framework.¶
# Generando modelo de regresión lineal múltiple con librerías y haciendo validación cruzada
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
# Creando el modelo
model = LinearRegression()
# Realizando validación cruzada
scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
# Calculando el promedio de los scores
print(f"Error cuadrático medio: {round(np.mean(-scores),2)}")
Error cuadrático medio: 76.88%
df_X = X.copy()
df_y = y.copy()
df_X, df_y
( age Jitter(%) Jitter(Abs) Jitter:RAP Jitter:DDP Shimmer \
0 72 0.00662 0.000034 0.00401 0.01204 0.02565
1 72 0.00300 0.000017 0.00132 0.00395 0.02024
2 72 0.00481 0.000025 0.00205 0.00616 0.01675
3 72 0.00528 0.000027 0.00191 0.00573 0.02309
4 72 0.00335 0.000020 0.00093 0.00278 0.01703
... ... ... ... ... ... ...
5870 61 0.00406 0.000031 0.00167 0.00500 0.01896
5871 61 0.00297 0.000025 0.00119 0.00358 0.02315
5872 61 0.00349 0.000025 0.00152 0.00456 0.02499
5873 61 0.00281 0.000020 0.00128 0.00383 0.01484
5874 61 0.00282 0.000021 0.00135 0.00406 0.01907
Shimmer(dB) Shimmer:APQ5 Shimmer:APQ11 Shimmer:DDA HNR RPDE \
0 0.230 0.01309 0.01662 0.04314 21.640 0.41888
1 0.179 0.01072 0.01689 0.02982 27.183 0.43493
2 0.181 0.00844 0.01458 0.02202 23.047 0.46222
3 0.327 0.01265 0.01963 0.03317 24.445 0.48730
4 0.176 0.00929 0.01819 0.02036 26.126 0.47188
... ... ... ... ... ... ...
5870 0.160 0.01133 0.01549 0.02920 22.369 0.64215
5871 0.215 0.01277 0.01904 0.03157 22.886 0.52598
5872 0.244 0.01456 0.01877 0.04112 25.065 0.47792
5873 0.131 0.00870 0.01307 0.02078 24.422 0.56865
5874 0.171 0.01154 0.01470 0.02839 23.259 0.58608
DFA PPE sex
0 0.54842 0.16006 0
1 0.56477 0.10810 0
2 0.54405 0.21014 0
3 0.57794 0.33277 0
4 0.56122 0.19361 0
... ... ... ...
5870 0.55314 0.21367 0
5871 0.56518 0.12621 0
5872 0.57888 0.14157 0
5873 0.56327 0.14204 0
5874 0.57077 0.15336 0
[5875 rows x 15 columns],
motor_UPDRS
0 28.199
1 28.447
2 28.695
3 28.905
4 29.187
... ...
5870 22.485
5871 21.988
5872 21.495
5873 21.007
5874 20.513
[5875 rows x 1 columns])
2. Encuentra el número óptimo de predictores para el modelo utilizando el método filter y validación cruzada. Una vez que tengas el número óptimo, muestra las características seleccionadas.¶
Método Filter¶
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, r_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def filter__selection(x, y, n_features):
print("----- Optimal selection of number of features -----")
n_feats = n_features
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_feats:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle = True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
fselection_cv = SelectKBest(r_regression, k = n_feat)
fselection_cv.fit(x_train, y_train)
x_train = fselection_cv.transform(x_train)
regr_cv = LinearRegression()
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae,' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_feats[opt_index]
print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_feats, mse_nfeat)
axs[0].set_xlabel("k")
axs[0].set_ylabel("MSE")
axs[1].plot(n_feats, mae_nfeat)
axs[1].set_xlabel("k")
axs[1].set_ylabel("MAE")
axs[2].plot(n_feats, r2_nfeat)
axs[2].set_xlabel("k")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
#n_features = X_np.shape[1]
#n_features = np.arange(1, n_features + 1)
n_features = list(range(1, X_np.shape[1] + 1))
opt_features = filter__selection(X_np, y_np, n_features)
# Fit model with optimal number of features
regr = LinearRegression()
fselection = SelectKBest(r_regression, k = opt_features)
fselection.fit(X_np, y_np)
for i in range(len(fselection.get_support())):
if fselection.get_support()[i]:
print("Selected features: ",df_X.columns[i])
print("Selected features: ", fselection.get_feature_names_out())
x_transformed = fselection.transform(X_np)
regr.fit(x_transformed, y)
print("Model coefficients: ", regr.coef_)
print("Model intercept: ", regr.intercept_)
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 61.18050430804374 MAE: 6.715571167507814 R^2: 0.0732131614795893 ---- n features = 2 MSE: 60.04990390736022 MAE: 6.650276404388995 R^2: 0.09067836505141291 ---- n features = 3 MSE: 60.03634675591476 MAE: 6.638499316453741 R^2: 0.09042654447251224 ---- n features = 4 MSE: 60.03768081675141 MAE: 6.627608507088579 R^2: 0.09083604155507537 ---- n features = 5 MSE: 59.78184462585343 MAE: 6.612851752854485 R^2: 0.0942793769460738 ---- n features = 6 MSE: 59.692782998713824 MAE: 6.614683440751795 R^2: 0.0957268557724256 ---- n features = 7 MSE: 59.595518253759636 MAE: 6.612527029012449 R^2: 0.0971021308047634 ---- n features = 8 MSE: 59.58193378859703 MAE: 6.611305188832196 R^2: 0.09763615667853402 ---- n features = 9 MSE: 59.55209136688317 MAE: 6.608117735497109 R^2: 0.09813081089169173 ---- n features = 10 MSE: 59.59815916027962 MAE: 6.610330835548313 R^2: 0.0965567966100527 ---- n features = 11 MSE: 59.55423347560355 MAE: 6.6078113652992005 R^2: 0.09862510424410273 ---- n features = 12 MSE: 58.74496657365789 MAE: 6.5591529043803884 R^2: 0.11003598093494786 ---- n features = 13 MSE: 58.591843133126375 MAE: 6.53474970219856 R^2: 0.11242201103751844 ---- n features = 14 MSE: 56.94510841337241 MAE: 6.380222076912431 R^2: 0.13778987405804916 ---- n features = 15 MSE: 56.60644595828394 MAE: 6.345803571493344 R^2: 0.14319586760485375 Optimal number of features: 15
Selected features: age Selected features: Jitter(%) Selected features: Jitter(Abs) Selected features: Jitter:RAP Selected features: Jitter:DDP Selected features: Shimmer Selected features: Shimmer(dB) Selected features: Shimmer:APQ5 Selected features: Shimmer:APQ11 Selected features: Shimmer:DDA Selected features: HNR Selected features: RPDE Selected features: DFA Selected features: PPE Selected features: sex Selected features: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14'] Model coefficients: [[ 1.96519953e-01 4.36313941e+01 -6.28644125e+04 -3.69487912e+04 1.24265425e+04 1.01730529e+02 -5.85832544e+00 -1.93038315e+02 9.54677288e+01 -2.21953012e+01 -3.89430398e-01 7.68249643e-01 -2.16521331e+01 1.88680948e+01 -1.21367667e+00]] Model intercept: [29.25034623]
3,Repite el paso anterior pero con selección de características secuencial (Wrapper). Reporta los predictores óptimos encontrados por el método.¶
Método Wrapper¶
from sklearn.feature_selection import SequentialFeatureSelector
def wrapper_selection(x,y, n_features):
print("----- Optimal selection of number of features -----")
n_feats = n_features
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_feats:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle = True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
regr_cv = LinearRegression()
fselection_cv = SequentialFeatureSelector(regr_cv, n_features_to_select = "auto")
fselection_cv.fit(x_train, y_train)
x_train = fselection_cv.transform(x_train)
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae,' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_feats[opt_index]
print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_feats, mse_nfeat)
axs[0].set_xlabel("features")
axs[0].set_ylabel("MSE")
axs[1].plot(n_feats, mae_nfeat)
axs[1].set_xlabel("features")
axs[1].set_ylabel("MAE")
axs[2].plot(n_feats, r2_nfeat)
axs[2].set_xlabel("features")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
n_features = list(range(1, X_np.shape[1] + 1))
opt_features = wrapper_selection(X_np, y_np, n_features)
# Fit model with optimal number of features
regr = LinearRegression()
fselection = SequentialFeatureSelector(regr, n_features_to_select = opt_features)
fselection.fit(X_np, y_np)
for i in range(len(fselection.get_support())):
if fselection.get_support()[i]:
print("Selected features: ",df_X.columns[i])
print("Selected features: ", fselection.get_feature_names_out())
x_transformed = fselection.transform(X_np)
regr.fit(x_transformed, y)
print("Model coefficients: ", regr.coef_)
print("Model intercept: ", regr.intercept_)
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 58.88978716814447 MAE: 6.571484471403795 R^2: 0.10818635352885524 ---- n features = 2 MSE: 59.21743429929476 MAE: 6.593539850588182 R^2: 0.1038060221217199 ---- n features = 3 MSE: 59.200944871322555 MAE: 6.592100852614601 R^2: 0.102902468842249 ---- n features = 4 MSE: 58.947546421458775 MAE: 6.576366108346706 R^2: 0.10737203034365024 ---- n features = 5 MSE: 59.03190510803139 MAE: 6.576060823940002 R^2: 0.10597611799087339 ---- n features = 6 MSE: 59.342975926425844 MAE: 6.6023096852367145 R^2: 0.10041193962339645 ---- n features = 7 MSE: 59.25717467924771 MAE: 6.596511954625953 R^2: 0.10293243617229615 ---- n features = 8 MSE: 59.01548686547053 MAE: 6.578862021061539 R^2: 0.10598285903063653 ---- n features = 9 MSE: 59.070841449164746 MAE: 6.58402013811654 R^2: 0.10464404829430826 ---- n features = 10 MSE: 59.18571072823181 MAE: 6.5804134859491485 R^2: 0.10369754828531938 ---- n features = 11 MSE: 58.9641485247662 MAE: 6.5660277580089454 R^2: 0.10708435844475181 ---- n features = 12 MSE: 59.377202033293656 MAE: 6.6028310828011625 R^2: 0.10000501031802025 ---- n features = 13 MSE: 59.08116853640333 MAE: 6.596310207040114 R^2: 0.10543911242524495 ---- n features = 14 MSE: 59.16417187310041 MAE: 6.590507792371767 R^2: 0.10384311462498375 ---- n features = 15 MSE: 59.08778520386924 MAE: 6.581989684312167 R^2: 0.10566778388056455 Optimal number of features: 1
Selected features: age Selected features: ['x0'] Model coefficients: [[0.25218975]] Model intercept: [4.95308778]
4.Haz el mismo proceso del paso 2, pero ahora con el método de selección de características recursivo. Reporta los predictores óptimos encontrados por el método.¶
Método Recursivo¶
from sklearn.feature_selection import RFE
def recursive_selection(x,y,n_features):
print("----- Optimal selection of number of features -----")
n_feats = n_features
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_feats:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle = True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
regr_cv = LinearRegression()
fselection_cv = RFE(regr_cv, n_features_to_select=n_feat)
fselection_cv.fit(x_train, y_train)
x_train = fselection_cv.transform(x_train)
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae,' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_feats[opt_index]
#print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_feats, mse_nfeat)
axs[0].set_xlabel("features")
axs[0].set_ylabel("MSE")
axs[1].plot(n_feats, mae_nfeat)
axs[1].set_xlabel("features")
axs[1].set_ylabel("MAE")
axs[2].plot(n_feats, r2_nfeat)
axs[2].set_xlabel("features")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
#Generando número de lista de características, no en rango
n_features = list(range(1, X.shape[1] + 1))
opt_features =recursive_selection(X_np, y_np, n_features)
# Fit model with optimal number of features
regr = LinearRegression()
fselection = RFE(regr, n_features_to_select = opt_features)
fselection.fit(X_np, y_np)
for i in range(len(fselection.support_)):
if fselection.support_[i]:
print("Selected features: ", X_np.columns[i])
print("Selected features: ", fselection.get_feature_names_out())
x_transformed = fselection.transform(X_np)
regr.fit(x_transformed, y)
print("Model coefficients: ", regr.coef_)
print("Model intercept: ", regr.intercept_)
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 65.75990526085835 MAE: 6.9504375732296255 R^2: 0.0038676653544396757 ---- n features = 2 MSE: 65.7627590450476 MAE: 6.94906302811125 R^2: 0.0039726690530456745 ---- n features = 3 MSE: 65.78426513791153 MAE: 6.951342021020844 R^2: 0.004229296423902973 ---- n features = 4 MSE: 65.35472665270053 MAE: 6.941635647313044 R^2: 0.009310102400269171 ---- n features = 5 MSE: 64.57067600937253 MAE: 6.864245886088801 R^2: 0.02122988780755184 ---- n features = 6 MSE: 63.643578727597706 MAE: 6.810747505477023 R^2: 0.03620617666039041 ---- n features = 7 MSE: 63.8409161379282 MAE: 6.811172910236327 R^2: 0.033227140575342526 ---- n features = 8 MSE: 60.964382952828046 MAE: 6.642364744978326 R^2: 0.07638463095635088 ---- n features = 9 MSE: 60.37651403624598 MAE: 6.594998752285181 R^2: 0.08555596406720942 ---- n features = 10 MSE: 60.20274602360678 MAE: 6.586963824208683 R^2: 0.08749218520592013 ---- n features = 11 MSE: 60.230518724743526 MAE: 6.574452857912428 R^2: 0.08688997424494208 ---- n features = 12 MSE: 60.55063900367272 MAE: 6.564762286552866 R^2: 0.08188083972639268 ---- n features = 13 MSE: 60.15327905270931 MAE: 6.53836688851635 R^2: 0.08951153989541014 ---- n features = 14 MSE: 58.59641364938481 MAE: 6.468637466600922 R^2: 0.1118557342192068 ---- n features = 15 MSE: 56.972009020696895 MAE: 6.356525952707213 R^2: 0.1367613147177548
Selected features: age Selected features: Jitter(%) Selected features: Jitter(Abs) Selected features: Jitter:RAP Selected features: Jitter:DDP Selected features: Shimmer Selected features: Shimmer(dB) Selected features: Shimmer:APQ5 Selected features: Shimmer:APQ11 Selected features: Shimmer:DDA Selected features: HNR Selected features: RPDE Selected features: DFA Selected features: PPE Selected features: sex Selected features: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14'] Model coefficients: [[ 1.96519953e-01 4.36313941e+01 -6.28644125e+04 -3.69487912e+04 1.24265425e+04 1.01730529e+02 -5.85832544e+00 -1.93038315e+02 9.54677288e+01 -2.21953012e+01 -3.89430398e-01 7.68249643e-01 -2.16521331e+01 1.88680948e+01 -1.21367667e+00]] Model intercept: [29.25034623]
5.Repita los pasos anteriores, pero utilizando un modelo de regresión no lineal como K-vecinos más cercanos.¶
Método Filter¶
from sklearn.feature_selection import SelectFromModel
from sklearn.neighbors import KNeighborsRegressor
def filter_selection_knn(x, y, n_features):
print("----- Optimal selection of number of features -----")
n_feats = n_features
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_feats:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
fselection_cv = SelectKBest(f_regression, k=n_feat)
fselection_cv.fit(x_train, y_train)
x_train = fselection_cv.transform(x_train)
regr_cv = KNeighborsRegressor(n_neighbors=5)
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae, ' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_feats[opt_index]
print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_feats, mse_nfeat)
axs[0].set_xlabel("k")
axs[0].set_ylabel("MSE")
axs[1].plot(n_feats, mae_nfeat)
axs[1].set_xlabel("k")
axs[1].set_ylabel("MAE")
axs[2].plot(n_feats, r2_nfeat)
axs[2].set_xlabel("k")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
# Aplicación del modelo KNN con el número óptimo de características
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
n_features = list(range(1, X_np.shape[1] + 1))
opt_features = filter_selection_knn(X_np, y_np, n_features)
# Fit model with optimal number of features
regr = KNeighborsRegressor(n_neighbors=5)
fselection = SelectKBest(f_regression, k=opt_features)
fselection.fit(X_np, y_np)
for i in range(len(fselection.get_support())):
if fselection.get_support()[i]:
print("Selected features: ", df_X.columns[i])
print("Selected features: ", fselection.get_feature_names_out())
x_transformed = fselection.transform(X_np)
regr.fit(x_transformed, y_np)
print("Modelo ajustado con las características seleccionadas.")
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 47.93807842612235 MAE: 4.777798161702128 R^2: 0.2735967853129878 ---- n features = 2 MSE: 24.317699499472884 MAE: 3.50677354893617 R^2: 0.6317032420804723 ---- n features = 3 MSE: 23.282341369713496 MAE: 3.4210451370212764 R^2: 0.6463645401014431 ---- n features = 4 MSE: 22.95872415920218 MAE: 3.379794491914894 R^2: 0.6524647820588922 ---- n features = 5 MSE: 20.60775102665954 MAE: 3.174437045106383 R^2: 0.6874300594915018 ---- n features = 6 MSE: 18.354891491138588 MAE: 2.9453916119148933 R^2: 0.7211186073673168 ---- n features = 7 MSE: 17.097514459804323 MAE: 2.8492717004255317 R^2: 0.7410684955355642 ---- n features = 8 MSE: 17.201069747154587 MAE: 2.84826245787234 R^2: 0.7392401489934868 ---- n features = 9 MSE: 17.122566662749485 MAE: 2.8548182365957446 R^2: 0.740553009056006 ---- n features = 10 MSE: 17.156309153872748 MAE: 2.85737290893617 R^2: 0.740310524844151 ---- n features = 11 MSE: 17.238215803810792 MAE: 2.862948272340426 R^2: 0.7389036020098807 ---- n features = 12 MSE: 17.189599517977598 MAE: 2.858810471489362 R^2: 0.7395496162620218 ---- n features = 13 MSE: 17.287837499832648 MAE: 2.8610996493617025 R^2: 0.7383011241615893 ---- n features = 14 MSE: 17.514030856480474 MAE: 2.872266338723404 R^2: 0.7342532238273332 ---- n features = 15 MSE: 9.707268928648306 MAE: 2.159997249361702 R^2: 0.8530000054101086 Optimal number of features: 15
Selected features: age Selected features: Jitter(%) Selected features: Jitter(Abs) Selected features: Jitter:RAP Selected features: Jitter:DDP Selected features: Shimmer Selected features: Shimmer(dB) Selected features: Shimmer:APQ5 Selected features: Shimmer:APQ11 Selected features: Shimmer:DDA Selected features: HNR Selected features: RPDE Selected features: DFA Selected features: PPE Selected features: sex Selected features: ['x0' 'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14'] Modelo ajustado con las características seleccionadas.
Método Wrapper¶
def wrapper_selection_knn(x,y, n_features):
print("----- Optimal selection of number of features -----")
n_feats = n_features
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_feats:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle = True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
regr_cv = KNeighborsRegressor(n_neighbors=5)
fselection_cv = SequentialFeatureSelector(regr_cv, n_features_to_select = "auto")
fselection_cv.fit(x_train, y_train)
x_train = fselection_cv.transform(x_train)
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae,' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_feats[opt_index]
print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_feats, mse_nfeat)
axs[0].set_xlabel("features")
axs[0].set_ylabel("MSE")
axs[1].plot(n_feats, mae_nfeat)
axs[1].set_xlabel("features")
axs[1].set_ylabel("MAE")
axs[2].plot(n_feats, r2_nfeat)
axs[2].set_xlabel("features")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
n_features = list(range(1, X_np.shape[1] + 1))
opt_features = wrapper_selection(X_np, y_np, n_features)
# Fit model with optimal number of features
regr = KNeighborsRegressor(n_neighbors=5)
fselection = SequentialFeatureSelector(regr, n_features_to_select = opt_features)
fselection.fit(X_np, y_np)
for i in range(len(fselection.get_support())):
if fselection.get_support()[i]:
print("Selected features: ",df_X.columns[i])
print("Selected features: ", fselection.get_feature_names_out())
x_transformed = fselection.transform(X_np)
regr.fit(x_transformed, y)
print("Modelo ajustado con las características seleccionadas.")
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 59.34760860100374 MAE: 6.595670610170674 R^2: 0.1013113187598083 ---- n features = 2 MSE: 59.15797838057383 MAE: 6.584643462729472 R^2: 0.10412548607825049 ---- n features = 3 MSE: 59.09706099599292 MAE: 6.590790209408409 R^2: 0.10442001820627171 ---- n features = 4 MSE: 59.31629703268288 MAE: 6.595676643391144 R^2: 0.1021180059838904 ---- n features = 5 MSE: 59.166585309002656 MAE: 6.589141758316506 R^2: 0.10320356138321247 ---- n features = 6 MSE: 59.464386251312746 MAE: 6.599897948859976 R^2: 0.09929610664992088 ---- n features = 7 MSE: 59.02223446636926 MAE: 6.580355751444176 R^2: 0.10575801198118387 ---- n features = 8 MSE: 59.10332742000609 MAE: 6.585280306042774 R^2: 0.10308142063033432 ---- n features = 9 MSE: 59.28310716865288 MAE: 6.5910971261368925 R^2: 0.10122552373044073 ---- n features = 10 MSE: 59.01720172802909 MAE: 6.579123018445931 R^2: 0.10630569664291663 ---- n features = 11 MSE: 59.00308066175184 MAE: 6.582890351040604 R^2: 0.10596869743874224 ---- n features = 12 MSE: 59.1398125503031 MAE: 6.581044457169879 R^2: 0.10418662003865267 ---- n features = 13 MSE: 59.195302361450715 MAE: 6.584658553594556 R^2: 0.10312350771069383 ---- n features = 14 MSE: 59.04620017250011 MAE: 6.580068466063436 R^2: 0.10628383317910595 ---- n features = 15 MSE: 59.14954859449473 MAE: 6.582543179231756 R^2: 0.10348200920827337 Optimal number of features: 11
Selected features: age Selected features: Jitter(%) Selected features: Jitter(Abs) Selected features: Jitter:RAP Selected features: Shimmer Selected features: Shimmer(dB) Selected features: Shimmer:APQ5 Selected features: Shimmer:APQ11 Selected features: Shimmer:DDA Selected features: DFA Selected features: sex Selected features: ['x0' 'x1' 'x2' 'x3' 'x5' 'x6' 'x7' 'x8' 'x9' 'x12' 'x14'] Modelo ajustado con las características seleccionadas.
Método Recursivo¶
def filter_selection_knn(x, y, n_features):
print("----- Optimal selection of number of features -----")
mse_nfeat = []
mae_nfeat = []
r2_nfeat = []
for n_feat in n_features:
print('---- n features =', n_feat)
mse_cv = []
mae_cv = []
r2_cv = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(x):
# Training phase
x_train = x[train_index, :]
y_train = y[train_index]
# Selección de características usando SelectKBest con f_regression
fselection_cv = SelectKBest(score_func=f_regression, k=n_feat)
x_train = fselection_cv.fit_transform(x_train, y_train)
regr_cv = KNeighborsRegressor(n_neighbors=5)
regr_cv.fit(x_train, y_train)
# Test phase
x_test = fselection_cv.transform(x[test_index, :])
y_test = y[test_index]
y_pred = regr_cv.predict(x_test)
mse_i = mean_squared_error(y_test, y_pred)
mse_cv.append(mse_i)
mae_i = mean_absolute_error(y_test, y_pred)
mae_cv.append(mae_i)
r2_i = r2_score(y_test, y_pred)
r2_cv.append(r2_i)
mse = np.average(mse_cv)
mse_nfeat.append(mse)
mae = np.average(mae_cv)
mae_nfeat.append(mae)
r2 = np.average(r2_cv)
r2_nfeat.append(r2)
print('MSE:', mse, ' MAE:', mae,' R^2:', r2)
opt_index = np.argmin(mse_nfeat)
opt_features = n_features[opt_index]
print("Optimal number of features: ", opt_features)
fig, axs = plt.subplots(1, 3, tight_layout=True)
axs[0].plot(n_features, mse_nfeat)
axs[0].set_xlabel("k")
axs[0].set_ylabel("MSE")
axs[1].plot(n_features, mae_nfeat)
axs[1].set_xlabel("k")
axs[1].set_ylabel("MAE")
axs[2].plot(n_features, r2_nfeat)
axs[2].set_xlabel("k")
axs[2].set_ylabel("r^2")
plt.show()
return opt_features
X_np = df_X.to_numpy()
y_np = df_y.to_numpy().ravel()
# Generando lista de número de características
n_features = list(range(1, X_np.shape[1] + 1))
# Obtener el número óptimo de características
opt_features = filter_selection_knn(X_np, y_np, n_features)
# Ajustar el modelo con el número óptimo de características
regr = KNeighborsRegressor(n_neighbors=5)
fselection = SelectKBest(score_func=f_regression, k=opt_features)
X_selected = fselection.fit_transform(X_np, y_np)
selected_features = df_X.columns[fselection.get_support()]
print("Selected features: ", selected_features)
regr.fit(X_selected, y_np)
print("Modelo ajustado con las características seleccionadas.")
----- Optimal selection of number of features ----- ---- n features = 1 MSE: 47.04745162546417 MAE: 4.654235530212765 R^2: 0.28752017226450155 ---- n features = 2 MSE: 25.244382466858482 MAE: 3.5689276051063827 R^2: 0.6172346133247337 ---- n features = 3 MSE: 23.45651618999973 MAE: 3.4018965548936166 R^2: 0.6451133874608497 ---- n features = 4 MSE: 23.260926520008915 MAE: 3.402694491914893 R^2: 0.6473635873043969 ---- n features = 5 MSE: 20.926808299485074 MAE: 3.1923297531914896 R^2: 0.6830628390320189 ---- n features = 6 MSE: 19.17933259698608 MAE: 3.0205230672340426 R^2: 0.7095325962528832 ---- n features = 7 MSE: 18.030087376365888 MAE: 2.9308562042553192 R^2: 0.7271313714722586 ---- n features = 8 MSE: 17.333891637300628 MAE: 2.8638473429787235 R^2: 0.7367862848562116 ---- n features = 9 MSE: 17.036761498613107 MAE: 2.848948493617021 R^2: 0.7414572717506436 ---- n features = 10 MSE: 16.944562390053513 MAE: 2.8317771812765957 R^2: 0.743028327628474 ---- n features = 11 MSE: 17.461273044445548 MAE: 2.8686014638297874 R^2: 0.7353195181542412 ---- n features = 12 MSE: 17.22832822432967 MAE: 2.856582114042553 R^2: 0.7390647084209288 ---- n features = 13 MSE: 17.347280308662604 MAE: 2.8609842723404255 R^2: 0.7373116935038795 ---- n features = 14 MSE: 17.32118724038829 MAE: 2.8580348765957444 R^2: 0.7374328247303565 ---- n features = 15 MSE: 9.66387328762914 MAE: 2.1650288714893615 R^2: 0.8537447610536585 Optimal number of features: 15
Selected features: Index(['age', 'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:DDP',
'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ5', 'Shimmer:APQ11',
'Shimmer:DDA', 'HNR', 'RPDE', 'DFA', 'PPE', 'sex'],
dtype='object')
Modelo ajustado con las características seleccionadas.
6.Busca al menos otros 4 modelos de regresión no lineal, y lleva a cabo los pasos del 1 al 5.¶
Función cross_validate_model¶
from sklearn.svm import SVR
def cross_validate_model(model, X, y, cv_folds=5, scoring='neg_mean_squared_error'):
if isinstance(model, SVR):
model = make_pipeline(StandardScaler(), model)
kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring=scoring)
mean_score = scores.mean()
return mean_score
Función método Filter¶
def filter_selection(X, y, model, k_features=5):
selector = SelectKBest(score_func=f_regression, k=k_features)
pipeline = make_pipeline(StandardScaler(), selector, model)
mean_score = cross_validate_model(pipeline, X, y, cv_folds=5, scoring='neg_mean_squared_error')
selector.fit(X, y)
selected_features = selector.get_support(indices=True)
return mean_score, selected_features
Función método Wrapper¶
def wrapper_selection(X, y, model, k_features=5):
sfs = SequentialFeatureSelector(model, n_features_to_select=k_features, direction='forward')
pipeline = make_pipeline(StandardScaler(), sfs, model)
mean_score = cross_validate_model(pipeline, X, y, cv_folds=5, scoring='neg_mean_squared_error')
sfs.fit(X, y)
selected_features = sfs.get_support(indices=True)
return mean_score, selected_features
Función método Recursivo¶
from sklearn.neighbors import KNeighborsClassifier
def recursive_selection(X, y, model, k_features=5):
if isinstance(model, (SVR, KNeighborsClassifier)):
raise ValueError("RFE no es compatible con el modelo seleccionado. Usa un estimador que tenga coef_ o feature_importances_")
rfe = RFE(model, n_features_to_select=k_features)
pipeline = make_pipeline(StandardScaler(), rfe, model)
mean_score = cross_validate_model(pipeline, X, y, cv_folds=5, scoring='neg_mean_squared_error')
rfe.fit(X, y)
selected_features = rfe.get_support(indices=True)
return mean_score, selected_features
"Main"¶
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsClassifier
X = df_X.to_numpy()
y = df_y.to_numpy().ravel()
# Modelos a evaluar
models = {
'Máquinas de Vectores de Soporte (SVR)': SVR(kernel='rbf'),
'DecisionTreeRegressor': DecisionTreeRegressor(),
'RandomForestRegressor': RandomForestRegressor(),
'MLPRegressor': MLPRegressor(hidden_layer_sizes=(100,), activation='relu', max_iter=1000)
}
for name, model in models.items():
print(f"\n\nEvaluando modelo: {name}")
# Método Filter
print("Método Filter")
mean_score, selected_features = filter_selection(X, y, model)
print(f"Error cuadrático medio: {-mean_score:.3f}")
print(f"Número de características seleccionadas: {len(selected_features)}")
print(f"Índices de características seleccionadas: {selected_features}")
# Método Wrapper
print("Método Wrapper")
mean_score, selected_features = wrapper_selection(X, y, model)
print(f"Error cuadrático medio: {-mean_score:.3f}")
print(f"Número de características seleccionadas: {len(selected_features)}")
print(f"Índices de características seleccionadas: {selected_features}")
# Método Recursivo
print("Método Recursivo")
if isinstance(model, (SVR, KNeighborsClassifier)):
print("El modelo no es compatible con RFE.")
else:
mean_score, selected_features = recursive_selection(X, y, model)
print(f"Error cuadrático medio: {-mean_score:.3f}")
print(f"Número de características seleccionadas: {len(selected_features)}")
print(f"Índices de características seleccionadas: {selected_features}")
print("\n")
Evaluando modelo: Máquinas de Vectores de Soporte (SVR) Método Filter Error cuadrático medio: 47.852 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 8 10 11 13] Método Wrapper Error cuadrático medio: 56.060 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 1 3 4 13 14] Método Recursivo El modelo no es compatible con RFE. Evaluando modelo: DecisionTreeRegressor Método Filter Error cuadrático medio: 27.486 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 8 10 11 13] Método Wrapper Error cuadrático medio: 15.180 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 2 10 13 14] Método Recursivo Error cuadrático medio: 12.140 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 2 9 12 14] Evaluando modelo: RandomForestRegressor Método Filter Error cuadrático medio: 15.497 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 8 10 11 13] Método Wrapper Error cuadrático medio: 39.804 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 1 2 8 13 14] Método Recursivo Error cuadrático medio: 7.002 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 2 9 12 14] Evaluando modelo: MLPRegressor Método Filter
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet. warnings.warn( c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet. warnings.warn( c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet. warnings.warn(
Error cuadrático medio: 41.087 Número de características seleccionadas: 5 Índices de características seleccionadas: [ 0 8 10 11 13] Método Wrapper
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the optimization hasn't converged yet.
warnings.warn(
c:\Users\palmi\.conda\envs\concentracion\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:697: UserWarning: Training interrupted by user.
warnings.warn("Training interrupted by user.")
7. Viendo los resultados de este ejercicio, escriba una conclusión sobre los siguientes puntos:¶
A. ¿Consideras que el modelo de regresión lineal es adecuado para los datos. ¿Por qué?¶
Considero que el modelo de regresión linea no es adecuado para el uso de estos datos debido a que los valores no presentan una tendencia lineal, cuentan con otro comportamiento, por lo tanto, se deben emplear modelos no lineales.
B. ¿Qué método de selección de características consideras que funciona bien con los datos? ¿Por qué?¶
En las pruebas realizadas, el método recursivo fue el que tuvo un mejor comportamiento para elegir correctamente las variables para obtener un error cuadrático medio menor.
C. Del proceso de selección de características, ¿puedes identificar algunas que sean sobresalientes? ¿Qué información relevantes observas de dichas características?¶
La mayoria de los métodos de selección de variables eligieron Age, ShimmerDB, RPDE, ShimmerAPQ11 y PPE como características relevantes, esto quiere decir que son las carácterísticas más importantes se utilizaron para el modelo. Se limitaron a 5 registros para cualquier método de elección de características o variables.
D. ¿Los modelos de regresión no lineal funcionaron mejor que el lineal? ¿Por qué?¶
Sí, la mayoría encontró una forma para conseguir un error cuadrático medio, ya que al graficar los datos, se puede observar que no mantienen una tendencia lineal.
E. ¿Se puede concluir algo interesante sobre los resultados de modelar estos datos con regresión? Argumenta tu respuesta.¶
Modelar los datos con una regresión fue útil para identificar y cuantificar las relaciones entre múltiples variables y la variable a predecir (dependiente), sin embargo, esto puede variar dependiendo de la caldiad de ajuste y algunos coeficientes. Si se encuentran algunas violaciones de los supuestos o probemas de multiicolinealidad podría necesitar una transformación de los datos.
También es interesante comparar como los diferentes modelos y métodos pueden afectar al rendimiento, por ello, es importante probar diferentes modelos para comprobar cuál se puede ajustar mejor a los datos.